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1 VoiceXML replaces the familiar HTML interpreter (Web browser) with a 
VoiceXML interpreter and the moiue and keyboard with the human voice. 



W ^HljjtJNTIL RECENTLY, THE WEB DELIVERED INFORMATION AND SERVICES 

I | ^exclusively through visual interfaces on computers with displays, keyboards, 



^••%:. : v v pointing devices. The Web revolution largely bypassed the huge market 

; for: information and services represented by the worldwide installed base of telephones, for 
iwhich voice; input and audio output provide the sole means of interaction. 



ill:;; Development of speech services has been hin- 
dered by a lack of easy-to-use standard tools for 
I managing I the ; dialogue between user and service. 
I lliftteractive voice- response systems are characterized 
;;by expensive, closed application-development envi- 
\ Iron ments ill; Lack: : of tools inhibits portability of 
! iapplirauons | and; limits the availability of skilled 
; appliicatiori I developers. Consequently, voice appli- 



cations are cosdy to develop and deploy, so voice 
access is limited to only those services for which the 
business case is most compelling for voice access. 

Here, I offer an introduction to VoiceXML, an 
emerging standard XML-based markup language 
for distributed Web-based voice services, much as 
HTML is a language for distributed visual services. 
VoiceXML brings the power of Web development 
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and content delivery to voice- response applications, 
freeing developers from low-level programming and 
resource management. It also enables integration of 
voice services with data services, using the familiar 
client/server paradigm and leveraging the skills of 
Web developers to speed application development for 
this new medium. 

VoiceXML 1.0 was developed by the VoiceXML 
Forum (see www.voicexml.org), which released it in 
March 2000, and was accepted by the World Wide 
Web Consortium (W3C) two months later as the 
basis for developing a W3C dialogue markup lan- 
guage (see www.w3.org/Voice/). The initial version 
of the language included robust support for basic 
state-based dialogue capabilities, using a design with 
simple form-based natural language capabilities that 
leaves room to grow as the technology evolves. (You 
can find the VoiceXML specification at 
www.voicexml.org and a free implementation for 
personal use at www.alphaworks.ibm.com/tech/voic- 
eserversdk.) 

Basic Spoken Dialogue 

The basic spoken dialogue capabilities of VoiceXML 
are illustrated by the following VoiceXML docu- 
ment containing a menu and a form: 

<?xml verslon=" 1. 0"?> 
<vxml version =" I. (T> 

<menu> 

<prompt>Say one of: <enumerate/></prompt> 

<choice next="httpJ/www.sports.example/sports.vxmr , > 

Sports scores 
</choice> 

<cholce next="httpJ/www.weather.example/weather.vxml"> 

Weather Information 
</cholce> 

<cho!ce next="#iogin"> 

Log in 
</cholce> 
</menu> 

<form id =,, login"> 
<field name="phone_number" type="phone"> 

<prompt>Please say your complete phone number</prompt> 
</field> 

<fleld name="pin_code" type="digits"> 

<prompt> Please say your PIN code</prompt> 
</field> 
<block> 

<submit next="/servlet/login'7> 
</block> 
</form> 

</vxml> 



This document enables a dialogue like the following 
one in which the user selects an item from a menu: 



C (computer): Say one of: Sports scores; Weather 

information; Log in. 
H (human): Sports scores 

The computer then retrieves and interprets a 
VoiceXML document from www.sports. example/ 
sports.vxml containing a specification of the next 
segment of the dialogue (in this case, a sports infor- 
mation service). The dialogue uses the menu element 
(but not the form element). It also illustrates a capa- 
bility similar to the one provided for visual applica- 
tions by a set of static HTML hyperlinks. However, 
the static linking of information is only the Webs 
most basic function. The Webs most compelling fea- 
ture is its dynamic distributed services, which require 
forms. 

The form is VoiceXMLs basic dialogue unit, 
describing a set of inputs (fields) needed from the user 
to complete a transaction between the user agent 
(browser) and a server. Each field includes a prompt 
and a specification of what the user is allowed to say 
to provide the required input. The form also specifies 
what to do with the set of fields after they are col- 
lected. The following dialogue between a computer 
and a human uses both the menu and the form in the 
example VoiceXML document: 

C: Say one of: Sports scores; Weather information; 

Log in. 
H: Log in. 

C: Please say your complete phone number 

H: 914-555-1234 

C: Please say your PIN code 

H: 1 2 3 4 

The computer has now collected the two fields 
needed to complete the login, so it executes the code 
block containing a submit command, thus causing 
the information collected to be submitted to a server 
for processing. Each field specifies the set of accept- 
able user responses. Limiting these responses serves 
two purposes: 

• Allow them to be verified and provide help (in 
case of an invalid response) locally without the 
delay of a round trip over the network to the 
application server; and 

• Help achieve good speech-recognition accuracy, 
particularly over a relatively low-quality channel 
like a telephone. 

In the example VoiceXML document, the set of 
acceptable user inputs is specified implicitly by speci- 
fying a "type" attribute ("phone" and "digits" in the 
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example) in the field element. VoiceXML interpreters 
provide built-in support for a set of common field 
types, including number, digits, phone, date, and 
time. However, applications would be seriously con- 
strained if they were limited to only these built-in 
types. VoiceXML applications may specify their own 
field types using grammars, or enumerations in com- 
pact form of a set of phrases. The following example 
VoiceXML document illustrates the use of grammars 
in an online voice-enabled restaurant application: 

<form> 

<field name="drink"> 

<prompt>What would you like to drlnk?</prompt> 
<grammar> 

coffee | tea | orange juice | milk | nothing 
</grammar> 
</field> 

<field name="sandwich M > 

<prompt>What sandwich would you like?</prompt> 

<grammar src="sandwiches.gram"/> 
</field> 
<block> 

<submit next='7servlet/order"/> 
</b!ock> 
</form> 



The grammars here are specified using the Java 
Speech, Grammar Format (JSGF) (see 
java.sun.com/products/java-media/speech/). The 
first one is inline and consists of a list of words and 
phrases ("coffee," "tea," and so on) the user may say 
in response to the prompt for that field. The second 
is contained in an external file called 
"sandwiches .gram" : 

<ingredient> = ham | roast beef | tomato | lettuce | swiss [cheese]; 
<bread> = rye | white | whole wheat; 

public <sandwich> = <ingredient> ([and] <ingredient>)* on <bread>; 

This grammar consists of three rules: The first, 
labeled < ingredient >, specifies a list of phrases 
listing sandwich ingredients. The last phrase ("ingre- 
dient" in this case) in the list uses square brackets to 
indicate the word "cheese" is optional; thus, the user 
may say either "swiss" or "swiss cheese." The second 
rule, labeled <bread>, specifies a list of phrases 
naming breads. And the third rule, labeled < sand- 
wich >, specifies that a complete description of a 
sandwich consists of at least one ingredient, followed 
by zero or additional ingredients optionally sepa- 
rated by the word "and" and ending finally with the 
word "on" followed by the name of a bread. This last 
rule is marked "public," indicating it defines the 
phrases the user can actually say; the first two rules 
are used only in the formation of the < sand- 



wich rule. (For more on grammars, a good start- 
ing point is the JSGF reference manual at 
java.sun.com/products/java-media/speech/.) 

A typical dialogue between a computer and a 
human enabled by this form might be: 

C: What would you like to drink? 
H: Orange juice 

C: What sandwich would you like? 
H: Ham, lettuce, and swiss on rye 

The computer has now collected the two fields 
needed to complete the order, so it executes the code 
block containing a submit command, thus caus- 
ing the information collected to be submitted to a 
server for processing. 

VoiceXML and Web Standards 

As the first line of the food-ordering dialogue indi- 
cates, VoiceXML is an XML application, meaning it 
adheres to the XML standard that at its core speci- 
fies the meta-delimiters <, </, >, =, and " (see 
www.w3.org/TR/1998/REC-xml-19980210). The 
rest of the contents of the example are specific either 
to VoiceXML ("vxml," "menu," "prompt," "choice," 
and "next") or to the application ("Say one of:" and 
"Sports scores"). 

Basing VoiceXML on the XML standard yields 
some important benefits. The most important is it 
allows the reuse and easy retooling of existing tools for 
creating, transforming, and parsing XML documents. 
It also allows VoiceXML to make use of other com- 
plementary XML-based standards. For example, 
VoiceXML applications occasionally need to specify 
speech-synthesis parameters, such as volume, speak- 
ing rate, and pitch. For this purpose — specifying syn- 
thesis parameters — VoiceXML incorporates the XML 
markup from the Java Speech Markup Language, an 
industry standard for speech synthesis markup (see 
java.sun.com/products/java-media/speech/) 

The Distributed Model 

The Web brings to each user a worldwide array of 
information and services while bringing each infor- 
mation and service provider a worldwide customer 
base. Thus, a distributed application model is fun- 
damental to the Web; VoiceXML builds on the same 
distributed model that has already proved so suc- 
cessful for visual Web-based services. Figure 1 out- 
lines the distributed Web-based application model 
used by VoiceXML services accessed by telephone. 

The VoiceXML architecture is the same as the one 
in the more familiar visual Web application model, 
except the HTML interpreter (Web browser) is 



COMMUNICATIONS OP THE ACM September 2000/Vol. 43. No. 9 



55 



replaced by a VoiceXML interpreter, and voice 
replaces the mouse and keyboard as the user-interface 
medium. In addition to its core capabilities, 
VoiceXML provides more advanced features, includ- 
ing local validation and processing, audio playback 
and recording, and support for context-specific and 





Figure 1 .The distribu 
application model use 
services accessed t 


ted Web-based 
d by VoiceXML j 
>y telephone. 


1 


User 




audio 




; HTTP 




VbiceXHl 




telephone 
network 


Interpreter: 


Internet 


server : 







tapered help and for reusable subdialogues. 

Local processing and validation of user input is 
accomplished through a collection of elements pro- 
viding a more-or-less standard programming 
model. A "block" element allows code to be run at 
any point in the process of collecting inputs. A 
"filled" element allows input validation code to gain 
control upon completion of any set of user inputs; 
this element is particularly useful for the mixed-ini- 
tiative dialogue model in which the user is able to 
supply inputs in any order. Finally, a "script" ele- 
ment allows ECMAScript (also known 
as JavaScript) program fragments to be run locally 
at any point in the dialogue (see 
www.ecma.ch/ stand/ ecma-262 .htm) . 

The playback of prerecorded audio prompts is 
accomplished through an "audio" element. Recording 
of user messages is done through the "record" ele- 
ment; the recorded audio may then be played back 
locally using the "audio" element or uploaded to the 
server for storage, processing, or playback at a later 
time. 

Meanwhile, context-specific and tapered help is 
provided by a built-in system of events and event 
handlers. VoiceXML defines a set of events corre- 
sponding to, for example, a user request for help, a 
failure by the user to respond within a timeout 
period, or user input that doesn't match an active 
grammar. The application may then provide (in any 
given context, including a form or a field) an event 
handler responding appropriately to a given event for 
a particular context. Moreover, help may be tapered; 
a count may be specified for each event handler so a 
different handler is executed, depending on how 
many times the event has occurred in that context. 
For example, tapering can be used to provide increas- 
ingly more detailed messages each time a user asks for 
help. 



Finally, VoiceXML provides support for subdia- 
logues (an entire form that is executed), the result of 
which is to provide an input field to another form. 
This feature has two uses: provide a disambiguation 
or confirmation dialogue for an input and support 
reusable subdialogues. 

Compared with HTML 

While VoiceXML reuses many concepts and designs 
from HTML, it differs in several ways due to the 
differences between visual and voice interactions. 
For example, an HTML document is a single unit 
that is fetched from a network resource specified by a 
uniform resource identifier and presented to the user 
all at once, In contrast, a VoiceXML document con- 
tains a number of dialogue units (menus or forms) 
presented sequentially. This difference is due to the 
visual mediums ability to display a number of items 
in parallel, while the voice medium is inherendy 
sequential. 

Thus, although a given VoiceXML document may 
contain the same information as a corresponding 
HTML document, the VoiceXML document is 
structured differently to reflect the sequential nature 
of the voice medium. So, for example, the HTML 
equivalent of the menu in the simple VoiceXML doc- 
ument outlined earlier might be: 

Please select a service. 

<a h ref = "http://www.sports.exam pi e/spo rts.html" > 

Sports scores 
</a> 

<a href="httpJ/www.weather.example/weather.htmr'> 

Weather information 
</a> 

<a href="#login"> 

Log in. 
</a> 

In HTML, there is no need to identify this menu as 
a unit or to isolate it using markup structure from 
other elements on the same page. However, 
VoiceXML requires dialogue elements (menus and 
forms) to be identified as distinct units so they may be 
presented one at a time to the user. Thus, while an 
HTML document functions, in effect, as a single dia- 
logue unit, a VoiceXML document is a container of 
dialogue units, such as menus and forms, each con- 
taining logic to sequence the interpreter to the next 
unit. 

Another consequence of the sequential nature of 
the voice medium is the need for the markup to con- 
tain application logic for sequencing among dialogue 
units. This need is reflected in a tighter integration of 
sequential logic elements into VoiceXML than in 
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VokeXML support J simple "directed* 
dialogue: the computer directs the 
conyerjatum at each step by prompting the 
usee for the next piece of information. 



HTML. For example, VoiceXML contains markup 
elements for sequence control; in HTML, such con- 
trol is available only through the relatively more cum- 
bersome method of scripting. 

Natural Dialogue 

VoiceXML supports simple "directed" dialogues; the 
computer directs the conversation at each step by 
prompting the user for the next piece of informa- 
tion. Dialogues between humans don't operate on 
this simple model, of course. In a natural dialogue, 
each participant may take the initiative in leading 
the conversation. A computer-human dialogue mod- 
eled on this idea is referred to as a "mixed-initiative" 
dialogue, because either the computer or the human 
may take the initiative. 

The field of spoken interfaces is not nearly as 
mature as the field of visual interfaces, so standardiz- 
ing an approach to natural dialogue is more difficult 
than designing a standard language for describing 
visual interfaces like HTML. Nevertheless, 
VoiceXML takes some modest steps toward allowing 
applications to give users some degree of control over 
the conversation. 

In the forms described earlier, the user was asked to 
supply (by speaking) a value for each field of a form in 
sequence. The set of phrases the user could speak in 
response to each field prompt was specified by a sepa- 
rate grammar for each field. This approach allowed 
the user to supply one field value in sequence. Con- 
sider a form for airline travel reservations in which the 
user supplies a date, a city to fly from, and a city to fly 
to. A directed dialogue conversation for completing 
such a form might proceed as follows: 

C: On what date do you wish to fly? 

H: February 29th. 

C: From what city? 

H: New York. 

C: To what city? 

H: Chicago. 



In contrast, a somewhat more natural dialogue 
might proceed as follows: 

C: How can I help you? 

H: Td like to fly from New York on February 29th. 
C: Where would you like to fly to? 
H: To Chicago. 

VoiceXML enables such relatively natural dialogues 
by allowing input grammars to be specified at the 
form level, not just at the field level. A form-level 
grammar for these applications defines utterances that 
allow users to supply values for a number of fields in 
one utterance. For example, the utterance Td like to 
fly from New York on February 29th" supplies values 
for both the "from city" field and the "date" field. 
VoiceXML specifies a form-interpretation algorithm 
that then causes the browser to prompt the user for 
the values (one by one) of missing pieces of informa- 
tion (in this example, the "to city" field). 

VoiceXMLs special ability to accept free-form 
utterances is only a first step toward natural dialogue. 
VoiceXML will continue to evolve, incorporating 
more advanced features in support of natural 
dialogue. Q 
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